Goto

Collaborating Authors

 audio chunk


i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

arXiv.org Artificial Intelligence

We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.


Automated Detection of Sport Highlights from Audio and Video Sources

arXiv.org Artificial Intelligence

This study presents a novel Deep Learning-based and lightweight approach for the automated detection of sports highlights (HLs) from audio and video sources. HL detection is a key task in sports video analysis, traditionally requiring significant human effort. Our solution leverages Deep Learning (DL) models trained on relatively small datasets of audio Mel-spectrograms and grayscale video frames, achieving promising accuracy rates of 89% and 83% for audio and video detection, respectively. The use of small datasets, combined with simple architectures, demonstrates the practicality of our method for fast and cost-effective deployment. Furthermore, an ensemble model combining both modalities shows improved robustness against false positives and false negatives. The proposed methodology offers a scalable solution for automated HL detection across various types of sports video content, reducing the need for manual intervention. Future work will focus on enhancing model architectures and extending this approach to broader scene-detection tasks in media analysis.


Whispy: Adapting STT Whisper Models to Real-Time Environments

arXiv.org Artificial Intelligence

Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.


Emotion Detection And Analysis

#artificialintelligence

Emotion Detection and Analysis is a web application developed by the team The Mystic Forces as their final project of the AI5: productionizing AI course at Univ.ai under the guidance of Pavlos Protopapas (Scientific Program Director at the Institute for Applied Computational Science (IACS) at Harvard University) & Shivas Jayaram (Research @Harvard IACS Deep Learning Researcher, Educator, and Practitioner). The web application is an end-to-end implemented deep learning project. Public Speaking is not just a skill but an art which is not easily mastered. It has become an essential for every individual. In this digital world, where your office is your computer screen and online meeting platforms are the places to connect, one is asked to deliver presentations, briefings, and do meetings regularly.


Vehicle Sound Classification Using Deep Learning - Analytics Vidhya

#artificialintelligence

One of the most critical parameters of the audio signal is amplitude. Amplitude can be defined as the maximum displacement to amend the rest position, and Sometimes the rest position is also known as the central position, as you can see here in this diagram.


The AI that brought the Beatles and Cole Porter back to life

#artificialintelligence

It may sound like a lost track from The Beatles, but the catchy pop song, 'Daddy's Car', was composed by artificial intelligence (AI). The tune was created by Flow Machines, a system Sony taught to make music by feeding it 13,000 samples from different genres. Although the software is capable of creating the lead sheet, a human composer instructed it to produce a record in the style of The Beatles and wrote the lyrics. It may sound like a lost track from The Beatles, but the catchy pop song, 'Daddy's Car', was composed by artificial intelligence (AI). Sony has taught its AI, Flow Machines, how to compose music.


The AI that brought The Beatles and Cole Porter back to life: Listen to Sony software that can create new songs in the style of any artist

Daily Mail - Science & tech

It may sound like a lost track from The Beatles, but the catchy pop song, 'Daddy's Car', was composed by artificial intelligence (AI). The tune was created by Flow Machines, a system Sony taught to make music by feeding it 13,000 samples from different genres. Although the software is capable of creating the lead sheet, a human composer instructed it to produce a record in the style of The Beatles and wrote the lyrics. It may sound like a lost track from The Beatles, but the catchy pop song, 'Daddy's Car', was composed by artificial intelligence (AI). Sony has taught its AI, Flow Machines, how to compose music.